Metropolis-Hastings MCMC from Scratch¶
https://exowanderer.medium.com/metropolis-hastings-mcmc-from-scratch-in-python-c21e53c485b7
Metropolis-Hastings: Issues with Direct Calculations¶
Metropolis-Hastings (MH) is a common method of executing an MCMC, which is not too complex to implement or understand. The underlying principle of MCMC is that the chain is a sequence of states created through iterative estimations of new states. States are sets of parameters that define the likelihood. Most commonly, states are the parameters of an analytic model over which the MCMC is iterating in order to maximise the Posterior.
A Markov chain requires that each proposed (new) state is generated only from the previous state. In the Bayesian context, the acceptance of a new state often involves comparing the likelihood of the proposal state to the likelihood of the current state, with respect to a uniformly sampled acceptance threshold. This last part is often confusing to say out loud, but easy to code.
We will now detail a step-by-step procedure for updating a Markov Chain by comparing the likelihoods between the current state and a proposal (new) state — see previous instalment for details about Bayesian Updating. In our context below, the new state will be accepted if the ratio of the proposal likelihood over the current likelihood is greater than a threshold, randomly sampled from a standard Uniform distribution: U[0, 1).
Metropolis-Hastings for a single parameter system¶
[Note that {Name}, below, symbolizes “sampling the Named distribution - A Gaussian Likelihood”]
Sample the prior / proposal distribute to start with a random state as the initial parameter(s), and assign it as the current state x_init = curr = {Prior}
Compute the likelihood of the current state: $L_curr = N(mu, var)(curr).$
likelihood is the the posterior function to be sampled For a Gaussian Likelihood, to compute the Likelihood is to input x into the following equation: $$ L(x; mu, var) = \frac{1}{\sqrt{2\pi var}}exp(\frac{(x-mu)^2}{var}) $$Propose a new state within the prior / proposal distribution by sampling from the transition distribution
i.e., N(curr, stepsize) for a Normal transition distribution:proposed = {N(curr, stepsize)}
---proposed = N(curr, var) = curr + N(0, var)Compute the likelihood of the proposed state,
L_prop(proposed) = N(mu, sig)(proposed)Compute the acceptance criterion as the ratio of the proposed likelihood over the current likelihood: accept_crit = L_prop / L_curr
Sample a uniform distribution from zero to one as the acceptance threshold: accept_threshold = {U[0,1)}
If the acceptance criterion is greater than the acceptance threshold, then the proposed state is stored as the current state:
if accept_crit > accept_threshold: curr = proposed Repeat steps 3–7 until a predetermined number of iterations have been completed or convergence has been achieved.
import numpy as np
import pandas as pd
from plotly import graph_objs as go
from scipy.stats import gaussian_kde
acceptance criterion = L(proposed) / L(current)
If (proposed likelihood / current likelihood) > U(0,1):
then: (accept) current = proposed
Else: (reject) keep the current state.
If the acceptance criterion is small — as in the example above — then there is not much space or “wiggle room” in the uniform probability for the acceptance criterion to be greater than the acceptance threshold.
import numpy as np
def mcmc_updater(curr_state, curr_likeli, likelihood, proposal_distribution):
""" Propose a new state and compare the likelihoods
Args:
curr_state (float): the current parameter/state value
curr_likeli (float): the current likelihood estimate
likelihood (function): a function handle to compute the likelihood
proposal_distribution (function): a function handle to compute the
next proposal state
Returns:
(tuple): either the current state or the new state
and its corresponding likelihood
"""
# Generate a proposal state using the proposal distribution
# Proposal state == new guess state to be compared to current
proposal_state = proposal_distribution(curr_state)
# Calculate the acceptance criterion
prop_likeli = likelihood(proposal_state)
accept_crit = prop_likeli / curr_likeli
# Generate a random number between 0 and 1
accept_threshold = np.random.uniform(0, 1)
# If the acceptance criterion is greater than the random number,
# accept the proposal state as the current state
if accept_crit > accept_threshold:
return proposal_state, prop_likeli
# Else
return curr_state, curr_likeli
Metropolis-Hastings from Scratch in Python¶
The following metropolis_hastings routine inputs the required functions, hyperparameters, and initial conditions to iterate over the posterior distribution to converge the MCMC chain up to num_samples iterations.
def metropolis_hastings(
likelihood, proposal_distribution, initial_state,
num_samples, stepsize=0.5, burnin=0.2):
""" Compute the Markov Chain Monte Carlo
Args:
likelihood (function): a function handle to compute the likelihood
proposal_distribution (function): a function handle to compute the
next proposal state
initial_state (list): The initial conditions to start the chain
num_samples (integer): The number of samples to compte,
or length of the chain
burnin (float): a float value from 0 to 1.
The percentage of chain considered to be the burnin length
Returns:
samples (list): The Markov Chain,
samples from the posterior distribution
"""
samples = []
# The number of samples in the burn in phase
idx_burnin = int(burnin * num_samples)
# Set the current state to the initial state
curr_state = initial_state
curr_likeli = likelihood(curr_state)
for i in range(num_samples):
# The proposal distribution sampling and comparison
# occur within the mcmc_updater routine
curr_state, curr_likeli = mcmc_updater(
curr_state=curr_state,
curr_likeli=curr_likeli,
likelihood=likelihood,
proposal_distribution=proposal_distribution
)
# Append the current state to the list of samples
if i >= idx_burnin:
# Only append after the burnin to avoid including
# parts of the chain that are prior-dominated
samples.append(curr_state)
return samples
To use the functions above, we need to define the likelihood distribution and transition proposal distribution. In an experimentation context, the likelihood distribution is the distribution of probabilities that the proposed state (i.e., model parameters) explains the data [above explaination conflusion, this likelihood distribution is just sim the posterior distribution or the sampling distribution, here just cacluate the likelihood of the proposal data]. In addition, the proposal distribution is a function that generates proposed (new) states for the Markov chain, by sampling a Normal distribution around the current state.
if we want to establish the likelihood as a standard normal distribution — i.e., a normal distribution with a mean of zero and a standard deviation of one — then, our likelihood distribution would be defined as:
def likelihood(x):
# Standard Normal Distribution
# An underlying assumption of linear regression is that the residuals
# are Gaussian Normal Distributed; often, Standard Normal distributed
return np.exp(-x**2 / 2) / np.sqrt(2 * np.pi)
Next, we must define the proposal distribution — i.e., how to generate the next guess. For our direct MH algorithm here, we chose a normal distribution with a mean equal to the current state and a fixed standard deviation (or variance): N(curr, var). This is equivalent to adding a zero mean Normal to the current state:
proposed = N(curr, var) = curr + N(0, var)
Sampling over the (transition) proposal distribution specifies that the next guess (proposed state) is nearby the current state; and that it is randomly selected from a Gaussian (Normal) distribution.
The variance (std dev squared) of the proposal distribution is often referred to as the “step size”. The step size (std dev) should be selected based on the scale of the parameter space being probed. For example, if we are measuring the temperature in degrees, then a standard deviation of half-degree jumps is reasonable. In contrast, if we are measuring the height of humans in meters, then a standard deviation of half-meter jumps is too large.
def proposal_distribution(x, stepsize=0.5):
# Select the proposed state (new guess) from a Gaussian distriution
# centered at the current state, within a Guassian of width `stepsize`
return np.random.normal(x, stepsize)
def compute_kde(samples, x_range=None):
if x_range is None:
x_range = min(samples), max(samples)
g_kde = gaussian_kde(samples)
x_kde = np.linspace(*x_range, num=100)
return x_kde, g_kde(x_kde)
def plot_chain(
samples, burnin=0.2, initial=0.01, nsig=1, fmt='-', y_range=None,
width=1600, height=800, margins={'l':20, 'r':20, 't':50, 'b':20}):
plasma = [
'rgb(13, 8, 135, 1.0)',
'rgb(70, 3, 159, 1.0)',
'rgb(114, 1, 168, 1.0)',
'rgb(156, 23, 158, 1.0)',
'rgb(189, 55, 134, 1.0)',
'rgb(216, 87, 107, 1.0)',
'rgb(237, 121, 83, 1.0)',
'rgb(251, 159, 58, 1.0)',
'rgb(253, 202, 38, 1.0)',
'rgb(240, 249, 33, 1.0)'
]
num_samples = len(samples)
idx_burnin = int(num_samples*burnin)
idx_initial = int(num_samples*initial) + 1
sample_steps = np.arange(num_samples)
window = int(0.2*num_samples)
df = pd.DataFrame(samples, columns=['samples'])
df['low_q'] = df['samples'].rolling(window=window, center=True, min_periods=0).quantile(quantile=0.05)
df['high_q'] = df['samples'].rolling(window=window, center=True, min_periods=0).quantile(quantile=0.95)
estimate = np.mean(samples)
stddev = np.std(samples)
title = f'The estimate over the chain is: {estimate:0.2f} ± {stddev:0.2f}'
samples_posterior = samples[idx_burnin:]
samples_burnin = samples[:idx_burnin]
samples_initial = samples[:idx_initial]
if y_range is None:
std_post = np.std(samples_posterior)
y_range = min(samples) - nsig * std_post, max(samples) + nsig * std_post
x_kde_posterior, y_kde_posterior = compute_kde(samples_posterior)
x_kde_burnin, y_kde_burnin = compute_kde(samples_burnin, x_range=y_range)
x_kde_initial, y_kde_initial = compute_kde(samples_initial, x_range=y_range)
kde_trace_posterior = go.Scatter(
x=y_kde_posterior,
y=x_kde_posterior,
mode = 'lines',
line = {
'color': plasma[4],
'width': 2
},
name='Posterior Distribution',
xaxis="x2",
yaxis="y2",
fill="tozerox",
fillcolor='rgba(100, 0, 100, 0.20)',
)
kde_trace_burnin = go.Scatter(
x=y_kde_burnin,
y=x_kde_burnin,
mode = 'lines',
line = {
'color': plasma[6],
'width': 2
},
name='Burnin Distribution',
xaxis="x2",
yaxis="y2",
fill="tozerox",
fillcolor='rgba(100, 0, 100, 0.20)',
)
kde_trace_initial = go.Scatter(
x=y_kde_initial,
y=x_kde_initial,
mode = 'lines',
line = {
'color': plasma[1],
'width': 2
},
name='Initial Distribution',
xaxis="x2",
yaxis="y2",
fill="tozerox",
fillcolor='rgba(100, 0, 100, 0.20)',
)
plots = [
kde_trace_initial,
kde_trace_burnin,
kde_trace_posterior,
go.Scatter(
x=sample_steps,
y=df['low_q'],
line={'color':'rgba(255, 0, 0, 0.0)'},
showlegend=False
),
# fill between the endpoints of this trace and the endpoints of the trace before it
go.Scatter(
x=sample_steps,
y=df['high_q'],
line={'color':'rgba(255, 0, 0, 0.0)'},
fill="tonextx",
fillcolor='rgba(100, 0, 100, 0.20)',
name='Quantile 1 - 99% Region'
),
go.Scatter(
x=sample_steps[idx_burnin:],
y=samples_posterior,
name='Posterior Distribution',
line={'color':plasma[4]}
),
go.Scatter(
x=sample_steps[:idx_burnin],
y=samples_burnin,
name='Burn-in Region',
line={'color':plasma[6]}
),
go.Scatter(
x=sample_steps[:idx_initial],
y=samples_initial,
name='Initial Condition Dominated Region',
line={'color':plasma[1]}
)
]
layout = go.Layout(
title=title,
xaxis={'domain': [0, 0.88], 'showgrid': False},
xaxis2={'domain': [0.9, 1], 'showgrid': False},
yaxis={'range':y_range, 'showgrid': False},
yaxis2={
'anchor': 'x2',
'range': y_range,
'showgrid': False
},
width=width,
height=height,
margin=margins,
plot_bgcolor='rgba(255, 255, 255, 1)',
paper_bgcolor='rgba(255, 255, 255, 1)'
)
fig = go.Figure(plots, layout=layout)
fig.show()
MCMC Estimators and the Burnin Region¶
The MCMC estimator is often taken to be the mean over the chain within the Bayesian credible region of plus or minus the standard deviation of the chain.
$$estimator = mean(chain) +/- std(chain)$$
Because the initial conditions are set by the prior, we refer to early samples in the chain as prior-dominated. The posterior distribution is, necessarily, a subregion of the prior distribution. If the posterior includes large areas of the prior boundaries, then the prior may be too restrictive or the likelihood may not be well designed to model the data.
Because the posterior is expected to sample a subregion of the prior’s extent — and a well-constrained MCMC chain is expected to iteratively sample the posterior — we tend to only accept the latter 80% of the Markov Chain when computing the estimators: mean +/- standard deviation over the chain. The initial 20% is referred to as the “burn-in” phase, which is often excluded from estimators over the final chain.
As we will see later, estimating after excluding the burn-in is equivalent to:
$$estimator = mean(chain[burnin:]) +/- std(chain[burnin:])$$
def plot_chain_ensemble(
chains, burnin=0.2, alpha=0.25, width=1600, height=800,
margins={'l':20, 'r':20, 't':50, 'b':20}):
plasma = [
'#0d0887', '#46039f', '#7201a8', '#9c179e', '#bd3786', '#d8576b',
'#ed7953', '#fb9f3a', '#fdca26', '#f0f921'
]
n_chains = len(chains)
n_samples = np.min([len(chain_) for chain_ in chains])
chains = [chain_[:n_samples] for chain_ in chains]
sample_steps = np.arange(n_samples)
idx_burnin = int(burnin*n_samples)
column_names = [f'chain{nc_}' for nc_ in range(n_chains)]
df = pd.DataFrame(np.transpose(chains), columns=column_names)
plots = [
go.Scatter(
x=df.index,
y=df[col_],
name=f"{col_}: {df[col_][df.index > idx_burnin].mean():0.2f} ± {df[col_][df.index > idx_burnin].std():0.2f}",
line={'color':color_},
)
for col_, color_ in zip(df.columns, plasma)
]
for col_, color_ in zip(df.columns, plasma):
samples_posterior = df[col_]
x_kde_posterior, y_kde_posterior = compute_kde(samples_posterior)
plots.append(
go.Scatter(
x=y_kde_posterior,
y=x_kde_posterior,
mode = 'lines',
line = {
'color': color_,
'width': 2
},
name='Posterior Distribution',
xaxis="x2",
yaxis="y2",
fill="tozerox",
fillcolor='rgba(100, 0, 100, 0.20)',
)
)
title = 'Multichain Combination of Results. Expected Standard Normal, with 0.0 ± 1.0'
layout = {
'title': title,
'legend': {'orientation':'h'},
'xaxis': {'domain': [0, 0.88], 'showgrid': False},
'xaxis2': {'domain': [0.9, 1], 'showgrid': False},
'yaxis2': {'anchor': 'x2', 'showgrid': False},
"margin": margins,
"width": width,
"height": height,
'plot_bgcolor': 'rgba(255, 255, 255, 1)',
'paper_bgcolor': 'rgba(255, 255, 255, 1)'
}
fig = go.Figure(plots, layout=layout)
fig.show()
Examples of MCMC Cases¶
The initial behaviour of the latter chain (starting at x=40) shows two essential characteristics of MCMC chains that illuminate the burn-in region, and why we ignore it with our final solution. In particular, at 40-sigma away from the standard normal likelihood function (i.e., a very poor fit), the chain is flat for up to 100 steps. This flat behaviour is expected when the initial condition is too far away from the mode of a Normal likelihood distribution. We can understand that better by examining the MH algorithm again.
The algorithm computes two exponentials when comparing the current and proposal likelihoods. The latter could be computationally close enough to zero that it is lost in the digital representation. The acceptance criterion would therefore be either zero or NaN (~one divided by zero), which offers no wiggle room for the acceptance threshold to accept the proposal. This results in no updates for up to 100 proposed samples.
We narrow in on this example to illuminate that if an initial condition is too far from the maximum likelihood, then the MCMC may not converge, especially within the number of steps defined by the user.
Test the MCMC starting from **far** away from the solution
initiate state: `x = 40
np.random.seed(42)
burnin = 0.2
initial_state = 40
num_samples = int(1e4)
samples = metropolis_hastings(
likelihood,
proposal_distribution,
initial_state,
num_samples,
burnin=0.0
)
plot_chain(samples, width=800, height=400, burnin=burnin, initial=burnin/4)
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:23: RuntimeWarning: invalid value encountered in double_scalars /opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:23: RuntimeWarning: divide by zero encountered in double_scalars
Test the MCMC starting nearby the solution
initiate state: x = 5
np.random.seed(42)
burnin = 0.2
initial_state = 5
num_samples = int(1e4)
samples = metropolis_hastings(
likelihood,
proposal_distribution,
initial_state,
num_samples,
burnin=0.0
)
plot_chain(samples, burnin=burnin, width=800, height=400, initial=burnin/10, nsig=1)
The chain is configured to measure the distance between the state and the likelihood, within the prior. For simplicity, this example started at an initial state of x=0 (the trivial solution).
The chain probed the posterior, which is — by design here — a standard normal: N(0,1). The right-hand side of the figure above exhibits three density estimates over each section of the chain: initial, burn-in, and posterior. The density estimates evolve from too tight (initial: purple) to a close approximation of a standard normal (posterior: magenta).
Test the MCMC starting at the solution
initiate state: x = 0
np.random.seed(42)
burnin = 0.2
initial_state = 0
num_samples = int(1e4)
samples = metropolis_hastings(
likelihood,
proposal_distribution,
initial_state,
num_samples,
burnin=0.0
)
plot_chain(samples, width=800, height=400, burnin=burnin, initial=burnin/10)
Run the MCMC, starting at the solution, for 10x longer
initiate state: x = 0
num samples: n_samples = 1e5
np.random.seed(42)
burnin = 0.2
initial_state = 0
num_samples = int(1e5)
samples = metropolis_hastings(
likelihood,
proposal_distribution,
initial_state,
num_samples,
burnin=0.0
)
plot_chain(samples, width=800, height=400, burnin=burnin, initial=burnin/10, fmt='o')
Ensemble MCMC - Simple¶
Because the initial conditions may cause the chain to fail — not converge — it is often very useful to track multiple chains — after randomly sampling over the prior distribution to establish each of their initial conditions.
The simplistic ensemble method below provides multiple, synchronous experiments to probe the same likelihood and posterior distributions. The overlapping convergence of multiple MCMC chains — from randomly selected initial conditions — provides greater evidence that the specific model fits the data well.
np.random.seed(0)
burnin = 0.2
num_samples = int(1e3)
n_chains = 10
chains = [
metropolis_hastings(
likelihood=likelihood,
proposal_distribution=proposal_distribution,
initial_state=np.random.normal(0,20),
num_samples=num_samples,
burnin=0.0
)
for _ in range(n_chains)
]
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:23: RuntimeWarning: invalid value encountered in double_scalars /opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:23: RuntimeWarning: divide by zero encountered in double_scalars
The figure above shows 10 uniquely initialised MCMC chains probing the same likelihood and posterior distributions. Note that one of the chains exhibited the initially flat behaviour from a very low, initial likelihood evaluation.
plot_chain_ensemble(chains, burnin=0.2, alpha=0.5, width=1600, height=800, margins={'l':20, 'r':20, 't':50, 'b':20})